Skip to content

Conversation

@VeckoTheGecko
Copy link
Contributor

@VeckoTheGecko VeckoTheGecko commented Mar 27, 2025

Created the new OceanParcels website. Used https://xarray.dev as a starting point.

Made sure in the migration to bring example_data across as that is how parcels downloads example datasets.

Items still TODO:

  • Update "Projects that use parcels" section
  • Update features section

Fixes #112

@VeckoTheGecko
Copy link
Contributor Author

This is the final script I used to extract the article data to port to the new site

Details

"""Script to scrape articles from old OceanParcels website."""

import requests
from bs4 import BeautifulSoup
import json
import re
import sys


def scrape_articles(url):
    try:
        # Fetch the webpage
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        # Parse the HTML content
        soup = BeautifulSoup(response.text, "html.parser")

        # Find all card elements
        cards = soup.find_all("div", class_="card")

        # List to store extracted article information
        articles = []

        # Process each card
        for card in cards:
            try:
                # Extract title from h5 element
                title_elem = card.find("h5")
                title = title_elem.get_text(strip=True) if title_elem else ""

                # Extract authors (text immediately after h5)
                authors = ""
                if title_elem and title_elem.next_sibling:
                    authors = (
                        title_elem.next_sibling.strip()
                        if isinstance(title_elem.next_sibling, str)
                        else ""
                    )

                # Extract published info (journal, volume, pages)
                published_info = ""
                if title_elem:
                    # Find all text between authors and <br/>
                    next_elem = title_elem.find_next_sibling()
                    while next_elem and next_elem.name != "br":
                        if isinstance(next_elem, str):
                            published_info += next_elem.strip() + " "
                        else:
                            published_info += next_elem.get_text(strip=True) + " "
                        next_elem = next_elem.next_sibling
                    published_info = published_info.strip()

                # Extract DOI from card-link
                # Extract DOI from card-link
                doi_link = card.find(
                    "a", class_="card-link", href=lambda href: href and "doi" in href
                )
                if doi_link:
                    doi = doi_link.get("href", "")

                # Extract abstract from card-body
                card_body = card.find("div", class_="card-body")
                abstract = card_body.get_text(strip=True) if card_body else ""

                # Clean up abstract by replacing newlines and multiple spaces with single space
                authors = authors.rstrip(",")

                published_info = re.sub(r"\s*,", ",", published_info)

                # Create article dictionary
                article = {
                    "title": title,
                    "published_info": published_info,
                    "authors": authors,
                    "doi": doi,
                    "abstract": abstract,
                }
                article = {k: re.sub(r"\n\s*", " ", v) for k, v in article.items()}

                articles.append(article)

            except Exception as card_error:
                print(f"Error processing card: {card_error}")
                print("Problematic card HTML:")
                print(card.prettify())
                sys.exit(1)

        # Make articles chronological
        articles.reverse()

        # Save to JSON file
        with open("articles.json", "w", encoding="utf-8") as f:
            json.dump(articles, f, indent=2, ensure_ascii=False)

        print(f"Successfully scraped {len(articles)} articles.")
        return articles

    except requests.RequestException as e:
        print(f"Error fetching URL: {e}")
        sys.exit(1)


# Main execution
if __name__ == "__main__":
    url = "https://oceanparcels.org/articles.html"
    scrape_articles(url)

@VeckoTheGecko
Copy link
Contributor Author

Updated view @erikvansebille

image

@VeckoTheGecko
Copy link
Contributor Author

Let's merge on Monday :)

@VeckoTheGecko
Copy link
Contributor Author

Should we remove the placeholder celebrating-10-years post or do you want to write it before we merge @erikvansebille ?

@erikvansebille
Copy link
Member

Good point, I just removed it. Will write it in the coming week(?)

But perhaps you(?) can write a short blog post celebrating the new website launch, highlighting that we thank xarray for the design?

@VeckoTheGecko
Copy link
Contributor Author

But perhaps you(?) can write a short blog post celebrating the new website launch, highlighting that we thank xarray for the design?

done :)

@VeckoTheGecko VeckoTheGecko merged commit e2b5afc into main Mar 31, 2025
3 checks passed
@VeckoTheGecko VeckoTheGecko deleted the migration branch March 31, 2025 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revamp websites - which assets and pages do we keep?

3 participants